Skip to content

Conversation

spilchen
Copy link
Contributor

Long-running INSPECT statements (e.g., >3h) have previously failed due to connection resets, despite no server crash. This likely stems from a missed keep-alive or a client timeout, though the exact cause is unclear.

This change updates the roachtest to:

  • Run the INSPECT statement as a background job using a short statement_timeout (5s).
  • Poll the job table for job completion instead of waiting synchronously.
  • Report progress at 10% intervals to improve visibility without overwhelming the logs.

Fixes #155610

Release note: none

Epic: none

@spilchen spilchen self-assigned this Oct 17, 2025
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@spilchen spilchen force-pushed the gh-155610/251017/1508/inspect-long-run/pr-ready branch from 3a5fe19 to a2a23a2 Compare October 20, 2025 11:26
@spilchen spilchen marked this pull request as ready for review October 20, 2025 11:26
@spilchen spilchen requested review from a team and bghal October 20, 2025 11:26
Copy link
Contributor

@bghal bghal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bghal reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @spilchen)


pkg/cmd/roachtest/tests/inspect_throughput.go line 304 at r1 (raw file):

) (jobID int64) {
	// Set a short statement timeout to force INSPECT to run as a background job.
	if _, err := db.Exec("SET statement_timeout = '5s'"); err != nil {

Nit: Seems like the require package can be used in roachtests.


pkg/cmd/roachtest/tests/inspect_throughput.go line 311 at r1 (raw file):

	// Reset statement timeout.
	if _, resetErr := db.Exec("RESET statement_timeout"); resetErr != nil {

Does this need to break up the error checking? Could it be deferred?


pkg/cmd/roachtest/tests/inspect_throughput.go line 316 at r1 (raw file):

	// We expect a statement timeout error, which is normal.
	if err != nil && !strings.Contains(err.Error(), "statement timeout") {

Nit: Similarly require.ErrorContains would be nice.


pkg/cmd/roachtest/tests/inspect_throughput.go line 361 at r1 (raw file):

			// Check if job is complete.
			if status == "succeeded" {

Nit: Perhaps a switch on jobs.StatusMessage?

Long-running INSPECT statements (e.g., >3h) have previously failed due
to connection resets, despite no server crash. This likely stems from a
missed keep-alive or a client timeout, though the exact cause is
unclear.

This change updates the roachtest to:
- Run the INSPECT statement as a background job using a short
  statement_timeout (5s).
- Poll the job table for job completion instead of waiting
  synchronously.
- Report progress at 10% intervals to improve visibility without
  overwhelming the logs.

Fixes cockroachdb#155610

Release note: none

Epic: none
@spilchen spilchen force-pushed the gh-155610/251017/1508/inspect-long-run/pr-ready branch from a2a23a2 to f8b539b Compare October 21, 2025 17:21
Copy link
Contributor Author

@spilchen spilchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bghal)


pkg/cmd/roachtest/tests/inspect_throughput.go line 304 at r1 (raw file):

Previously, bghal wrote…

Nit: Seems like the require package can be used in roachtests.

Good idea.


pkg/cmd/roachtest/tests/inspect_throughput.go line 311 at r1 (raw file):

Previously, bghal wrote…

Does this need to break up the error checking? Could it be deferred?

I'll reset this in a defer


pkg/cmd/roachtest/tests/inspect_throughput.go line 361 at r1 (raw file):

Previously, bghal wrote…

Nit: Perhaps a switch on jobs.StatusMessage?

Good idea.

@spilchen
Copy link
Contributor Author

TFTR!

bors r+

craig bot pushed a commit that referenced this pull request Oct 21, 2025
155411: restore: prevent scatter of non-empty ranges in online restore r=jeffswenson a=kev-cao

In conventional restore, we prevent the scattering of non-empty ranges by passing in a max size for the `AdminScatter` request. In online restore, while we expect to almost always be scattering empty ranges, due to the fact that the link phase can retry, it is possible to scatter ranges that contain external SSTs. This breaks our golden rule of only scattering empty ranges. This commit teaches online restore to only ever scatter empty ranges.

Fixes: #144599

Release note: None

155653: roachtest: run INSPECT as background job and poll for completion r=spilchen a=spilchen

Long-running INSPECT statements (e.g., >3h) have previously failed due to connection resets, despite no server crash. This likely stems from a missed keep-alive or a client timeout, though the exact cause is unclear.

This change updates the roachtest to:
- Run the INSPECT statement as a background job using a short statement_timeout (5s).
- Poll the job table for job completion instead of waiting synchronously.
- Report progress at 10% intervals to improve visibility without overwhelming the logs.

Fixes #155610

Release note: none

Epic: none

155657: sql: admins now have full access to all external connections r=rafiss a=kev-cao

Previously, if a non-admin user were to create an external connection, that connection would not show up in the output of `SHOW EXTERNAL CONNECTIONS` that was run by an admin.

This patch fixes this bug so that admins can now see/use all external connections.

Fixes: #146773

Release note: Admin users now have full access to all external connections.

Co-authored-by: Kevin Cao <[email protected]>
Co-authored-by: Matt Spilchen <[email protected]>
@craig
Copy link
Contributor

craig bot commented Oct 21, 2025

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Oct 21, 2025

@craig craig bot merged commit 8aa3669 into cockroachdb:master Oct 21, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: inspect/throughput/bulkingest/nodes=12/cpu=8/rows=1000000000/checks=2 failed

3 participants